Cap analysis gene expression (CAGE) is a technique used in molecular biology to produce a snapshot of the 5' end of the messenger RNA population in a biological sample. The small fragments (usually 20-21 nucleotides long) from the very beginnings of mRNAs (5' ends of capped transcripts) are extracted, reverse-transcribed to DNA, PCR amplified and sequenced. CAGE was first published by Hayashizaki and co-workers in 2003 [1].
The output of CAGE is a set of short nucleotide sequences (often called tags) with their observed counts. Using a reference genome a researcher can usually determine, with some confidence, the original mRNA (and therefore which gene) the tag was extracted from. Copy numbers of CAGE tags provide an easy way of digital quantification of the RNA transcript abundances in biological samples.
Unlike a similar technique Serial Analysis of Gene Expression (SAGE, superSAGE) in which tags come from other parts of transcripts, CAGE is primarily used to locate an exact transcription start sites in the genome. This knowledge in turn allows a researcher to investigate promoter structure necessary for gene expression.
However, the CAGE protocol has a known bias with a nonspecific G at the most 5′ end of the CAGE tags, which is attributed to the template-free 5′-extension during the first-strand cDNA synthesis [2]. This would induce erroneous mapping of CAGE tags, for instance to nontranscribed pseudogenes [2].